Tuomas Keski-Kuha 6.11.2021
Here’s a link to my IODS-project
Hi!
Nice to be on the Intro course on Open Data Science. I work in risk management field in insurance business. I have studied applied mathematics and statistics in University of Helsinki some 10 years ago. I expect to learn some basics about the open data science. I found this course at the Mooc courses page.
Looking forward to learn more about R, and my goal is to study more R statiscs so I’m albe to use it in my work also, and perhaps also as a hobby! :D So starting with happy feelings, but in a some hurry to finish first week task
Tuomas Keski-Kuha 13.11.2021
The data is from the study where intention was to study different variables affecting students grades. It is a question type study.
# first let's read the data:
learning2014 <- read.csv(file = "data/learning2014.csv",
stringsAsFactors = TRUE)
# let's study the data structure etc.
str(learning2014)
## 'data.frame': 166 obs. of 7 variables:
## $ gender : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
## $ age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: int 37 31 25 35 37 38 35 29 38 21 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ points : int 25 12 24 10 22 21 21 31 24 26 ...
head(learning2014)
## gender age attitude deep stra surf points
## 1 F 53 37 3.583333 3.375 2.583333 25
## 2 M 55 31 2.916667 2.750 3.166667 12
## 3 F 49 25 3.500000 3.625 2.250000 24
## 4 M 53 35 3.500000 3.125 2.250000 10
## 5 M 49 37 3.666667 3.625 2.833333 22
## 6 F 38 38 4.750000 3.625 2.416667 21
In the data we have 166 observations (studens), and 7 variables. Some of the variables are describing the studens (gender, age, attitude) and (deep, stra, surf) are stundens’ answers in the study. Points are the points in the test, and the result variable.
Now let’s dig deeper in to the data, and also get an overview of it:
# ggplot library is allready installed and let's get the library into use
library(ggplot2)
#summarise data
summary(learning2014)
## gender age attitude deep stra
## F:110 Min. :17.00 Min. :14.00 Min. :1.583 Min. :1.250
## M: 56 1st Qu.:21.00 1st Qu.:26.00 1st Qu.:3.333 1st Qu.:2.625
## Median :22.00 Median :32.00 Median :3.667 Median :3.188
## Mean :25.51 Mean :31.43 Mean :3.680 Mean :3.121
## 3rd Qu.:27.00 3rd Qu.:37.00 3rd Qu.:4.083 3rd Qu.:3.625
## Max. :55.00 Max. :50.00 Max. :4.917 Max. :5.000
## surf points
## Min. :1.583 Min. : 7.00
## 1st Qu.:2.417 1st Qu.:19.00
## Median :2.833 Median :23.00
## Mean :2.787 Mean :22.72
## 3rd Qu.:3.167 3rd Qu.:27.75
## Max. :4.333 Max. :33.00
# let's also use library GGally
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
# then let's draw a plot where also the possible dependencies can be seen
# create a more advanced plot matrix with ggpairs()
p <- ggpairs(learning2014, mapping = aes(col = gender), lower = list(combo = wrap("facethist", bins = 20)))
p
In the summary picture you can really get the feeling of the data. Gender gives the coloring in the picture and all the information can be seen depending on the gender. In the picture you can see all the distributions of the variables.
Here’s few remarks on the distributions:
Here’s few remarks on the correlations:
Next, let’s apply linear model to the data. Response or dependent variable (is the variable you think shows the effect) here is the points and all other are predictors orindependent variables (is the variable you think is the cause of the effect).
Let’s choose three most valid independent variables by looking biggest correlations between the target variable (points). We get attitude, stra and surf as three biggest absolute correlation values.
# Let's create a linear regression model with multiple explanatory variables
my_model <- lm(points ~ attitude + stra + surf, data = learning2014)
summary(my_model)
##
## Call:
## lm(formula = points ~ attitude + stra + surf, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.01711 3.68375 2.991 0.00322 **
## attitude 0.33952 0.05741 5.913 1.93e-08 ***
## stra 0.85313 0.54159 1.575 0.11716
## surf -0.58607 0.80138 -0.731 0.46563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
We fitted linear model: y = a0 + a1x1 + a2x2 +a3x3 Looking the results:
Of course model will fit differently if we just ignore first one of the non significant variables, but here it seems that small a2 or a3 would not enchange the model that much, and even those extra parameters could even worsen the models prediction capability. But let’s fit a linear model with two variables (attitude and stra):
# Let's create a linear regression model with multiple independent variable
my_model_2 <- lm(points ~ attitude + stra, data = learning2014)
summary(my_model_2)
##
## Call:
## lm(formula = points ~ attitude + stra, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.6436 -3.3113 0.5575 3.7928 10.9295
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.97290 2.39591 3.745 0.00025 ***
## attitude 0.34658 0.05652 6.132 6.31e-09 ***
## stra 0.91365 0.53447 1.709 0.08927 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared: 0.2048, Adjusted R-squared: 0.1951
## F-statistic: 20.99 on 2 and 163 DF, p-value: 7.734e-09
Dropping the surf value did not affect that much, but at least it got for the a0 a better fit. Still overall results aren’t that impressive without the surf, if you look the residual standard error of residuals and also the residual values. So it seems that the best thing here would be to drop also the stra value (usually simple the better). It looks like residuals have the same kind of distribution as in the previous model.
Multiple R-squared value (ranges 0-1) gives answer the the proportion of the variance in the response variable (the effect) of a regression model that can be explained by the predictor variables (the cause). Low value here would indicate that the variance in response variable (points) is not explained that well by the predictors variables (attitude and stra). And here indeed R-squared value is not impressive, and there’s lot of variability which is not explained well by the predictors.
Also the adjusted R-squared did not change that much between the fist and the second model, and its score 0,2. Adjusted R-squared is more handy than Multiple R-squared, because it takes into account the number variables in the model and adjust the Multiple R-squared accordingly (R-squared willi increase as you add more variables). So with adjusted R-squared one can really evaluate the difference between the models when we consider the variance in the response variable.
For the fun of it, let’s also try fitting a simple line (y = a0+a1x1) where x1 is attitude.
# Let's create a linear regression model with just one independent variable
my_model_3 <- lm(points ~ attitude, data = learning2014)
summary(my_model_3)
##
## Call:
## lm(formula = points ~ attitude, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.63715 1.83035 6.358 1.95e-09 ***
## attitude 0.35255 0.05674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
We observe that the standard error of the residuals became a bit larger even though the fitting of the parameters was more successfull. Adjusted R-squared still became lower which is not good, but is the difference here significant. I guess not if you observe the standard errors of the residuals (model vs. actual values) and compare them with previous modesl with multiple predictive variables.
Let’s first draw a simple regression line to the data. Attitude being predictive variable:
# initialize plot with data and aesthetic mapping
p1 <- ggplot(learning2014, aes(x = attitude, y = points))
# define the visualization type (points)
p2=p1+geom_point()
# add a regression line
p3 = p2 +geom_smooth(method = "lm")
# draw the plot
p3
## `geom_smooth()` using formula 'y ~ x'
Let’s draw diagnostic plots and choose the following plots:
# draw diagnostic plots using the plot() function. Choose the plots 1, 2 and 5
plot(my_model_3, which = c(1, 2, 5))
Commenting the plots:
Tuomas Keski-Kuha 21.11.2021
# access a few libraries:
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
library(ggplot2)
# first let's read the data:
alc<- read.csv(file = "https://github.com/rsund/IODS-project/raw/master/data/alc.csv",
stringsAsFactors = FALSE, sep = ",")
# let's study the data structure etc.
glimpse(alc)
## Rows: 370
## Columns: 51
## $ school <chr> "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP",~
## $ sex <chr> "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F",~
## $ age <int> 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15,~
## $ address <chr> "R", "R", "R", "R", "R", "R", "R", "R", "R", "U", "U", "U",~
## $ famsize <chr> "GT3", "GT3", "GT3", "GT3", "GT3", "GT3", "GT3", "LE3", "LE~
## $ Pstatus <chr> "T", "T", "T", "T", "T", "T", "T", "T", "T", "A", "A", "T",~
## $ Medu <int> 1, 1, 2, 2, 3, 3, 3, 2, 3, 3, 4, 1, 1, 1, 1, 1, 2, 2, 2, 3,~
## $ Fedu <int> 1, 1, 2, 4, 3, 4, 4, 2, 1, 3, 3, 1, 1, 1, 2, 2, 1, 2, 3, 2,~
## $ Mjob <chr> "at_home", "other", "at_home", "services", "services", "ser~
## $ Fjob <chr> "other", "other", "other", "health", "services", "health", ~
## $ reason <chr> "home", "reputation", "reputation", "course", "reputation",~
## $ guardian <chr> "mother", "mother", "mother", "mother", "other", "mother", ~
## $ traveltime <int> 2, 1, 1, 1, 2, 1, 2, 2, 2, 1, 1, 3, 1, 1, 1, 1, 3, 1, 2, 1,~
## $ studytime <int> 4, 2, 1, 3, 3, 3, 3, 2, 4, 4, 2, 1, 2, 2, 2, 2, 3, 4, 1, 2,~
## $ schoolsup <chr> "yes", "yes", "yes", "yes", "no", "yes", "no", "yes", "no",~
## $ famsup <chr> "yes", "yes", "yes", "yes", "yes", "yes", "yes", "yes", "ye~
## $ activities <chr> "yes", "no", "yes", "yes", "yes", "yes", "no", "no", "no", ~
## $ nursery <chr> "yes", "no", "yes", "yes", "yes", "yes", "yes", "yes", "no"~
## $ higher <chr> "yes", "yes", "yes", "yes", "yes", "yes", "yes", "yes", "ye~
## $ internet <chr> "yes", "yes", "no", "yes", "yes", "yes", "yes", "yes", "yes~
## $ romantic <chr> "no", "yes", "no", "no", "yes", "no", "yes", "no", "no", "n~
## $ famrel <int> 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 3, 5, 5, 3, 3,~
## $ freetime <int> 1, 3, 3, 3, 2, 3, 2, 1, 4, 3, 3, 3, 3, 4, 3, 2, 2, 1, 5, 3,~
## $ goout <int> 2, 4, 1, 2, 1, 2, 2, 3, 2, 3, 2, 3, 2, 2, 2, 3, 2, 2, 1, 2,~
## $ Dalc <int> 1, 2, 1, 1, 2, 1, 2, 1, 2, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1,~
## $ Walc <int> 1, 4, 1, 1, 3, 1, 2, 3, 3, 1, 1, 2, 3, 2, 1, 2, 1, 1, 1, 1,~
## $ health <int> 1, 5, 2, 5, 3, 5, 5, 4, 3, 4, 1, 4, 4, 5, 5, 1, 4, 3, 5, 3,~
## $ n <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,~
## $ id.p <int> 1096, 1073, 1040, 1025, 1166, 1039, 1131, 1069, 1070, 1106,~
## $ id.m <int> 2096, 2073, 2040, 2025, 2153, 2039, 2131, 2069, 2070, 2106,~
## $ failures <int> 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2,~
## $ paid <chr> "yes", "no", "no", "no", "yes", "no", "no", "no", "no", "no~
## $ absences <int> 3, 2, 8, 2, 5, 2, 0, 1, 9, 10, 0, 3, 2, 0, 4, 1, 2, 6, 2, 1~
## $ G1 <int> 10, 10, 14, 10, 12, 12, 11, 10, 16, 10, 14, 10, 11, 10, 12,~
## $ G2 <int> 12, 8, 13, 10, 12, 12, 6, 10, 16, 10, 14, 6, 11, 12, 12, 14~
## $ G3 <int> 12, 8, 12, 9, 12, 12, 6, 10, 16, 10, 15, 6, 11, 12, 12, 14,~
## $ failures.p <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,~
## $ paid.p <chr> "yes", "no", "no", "no", "yes", "no", "no", "no", "no", "no~
## $ absences.p <int> 4, 2, 8, 2, 2, 2, 0, 0, 6, 10, 0, 6, 2, 0, 6, 0, 0, 4, 4, 2~
## $ G1.p <int> 13, 13, 14, 10, 13, 11, 10, 11, 15, 10, 15, 11, 13, 12, 13,~
## $ G2.p <int> 13, 11, 13, 11, 13, 12, 11, 10, 15, 10, 14, 12, 12, 12, 12,~
## $ G3.p <int> 13, 11, 12, 10, 13, 12, 12, 11, 15, 10, 15, 13, 12, 12, 13,~
## $ failures.m <int> 1, 2, 0, 0, 2, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3,~
## $ paid.m <chr> "yes", "no", "yes", "yes", "yes", "yes", "no", "yes", "no",~
## $ absences.m <int> 2, 2, 8, 2, 8, 2, 0, 2, 12, 10, 0, 0, 2, 0, 2, 2, 4, 8, 0, ~
## $ G1.m <int> 7, 8, 14, 10, 10, 12, 12, 8, 16, 10, 14, 8, 9, 8, 10, 16, 1~
## $ G2.m <int> 10, 6, 13, 9, 10, 12, 0, 9, 16, 11, 15, 0, 10, 11, 11, 15, ~
## $ G3.m <int> 10, 5, 13, 8, 10, 11, 0, 8, 16, 11, 15, 0, 10, 11, 11, 15, ~
## $ alc_use <dbl> 1.0, 3.0, 1.0, 1.0, 2.5, 1.0, 2.0, 2.0, 2.5, 1.0, 1.0, 1.5,~
## $ high_use <lgl> FALSE, TRUE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, TRUE,~
## $ cid <int> 3001, 3002, 3003, 3004, 3005, 3006, 3007, 3008, 3009, 3010,~
colnames(alc)
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "guardian" "traveltime" "studytime" "schoolsup"
## [16] "famsup" "activities" "nursery" "higher" "internet"
## [21] "romantic" "famrel" "freetime" "goout" "Dalc"
## [26] "Walc" "health" "n" "id.p" "id.m"
## [31] "failures" "paid" "absences" "G1" "G2"
## [36] "G3" "failures.p" "paid.p" "absences.p" "G1.p"
## [41] "G2.p" "G3.p" "failures.m" "paid.m" "absences.m"
## [46] "G1.m" "G2.m" "G3.m" "alc_use" "high_use"
## [51] "cid"
Data set information (straight from the web page):
This data approach student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. Important note: the target attribute G3 has a strong correlation with attributes G2 and G1. This occurs because G3 is the final year grade (issued at the 3rd period), while G1 and G2 correspond to the 1st and 2nd period grades. It is more difficult to predict G3 without G2 and G1, but such prediction is much more useful (see paper source for more details).
Still we are going to use the data to predict binary variable for alcohol consuption (high_use).
Expected 4 variables to effect alcohol consumption are as follows (the variable picking was biased as I did see the results in the datacamp exercise :D ). Anyway let’s pick
Let’s draw some plots and tables if we could see whether there might be any relationship between target variable (high_use) and predictors.
# produce summary statistics by group for studying relationship between high_use and sex
alc %>% group_by(sex, high_use) %>% summarise(count = n())
## `summarise()` has grouped output by 'sex'. You can override using the `.groups` argument.
## # A tibble: 4 x 3
## # Groups: sex [2]
## sex high_use count
## <chr> <lgl> <int>
## 1 F FALSE 154
## 2 F TRUE 41
## 3 M FALSE 105
## 4 M TRUE 70
# let's draw a plot to study how absences might relate to the target variable
g1 <- ggplot(alc, aes(x = high_use, y = absences, col = sex))
g1 + geom_boxplot() + ggtitle("Student absences by alcohol consumption and sex")
# initialize a plot of high_use related to gout
g2 <- ggplot(data = alc, aes(x = goout, fill = high_use))
# define the plot as a bar plot and draw it
g2 + geom_bar() + ggtitle("Student go-out-tendency by high alcohol consumption")
# let's count few crosstables for examining high_use relatedness to failures
table(high_use = alc$high_use, failures = alc$failures)
## failures
## high_use 0 1 2 3
## FALSE 238 12 8 1
## TRUE 87 12 9 3
In this chapter, let’s apply a logistic regression model to predict high alcohol use. For the model we use those variables which seemed have an effect to high alcohol use (sex, absences, go_out, failures).
# find the model with glm() (target high_use)
m <- glm(high_use ~ sex + goout + absences + failures, data = alc, family = "binomial")
summary(m)
##
## Call:
## glm(formula = high_use ~ sex + goout + absences + failures, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1521 -0.7929 -0.5317 0.7412 2.3706
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.14389 0.47881 -8.654 < 2e-16 ***
## sexM 0.97989 0.26156 3.746 0.000179 ***
## goout 0.69801 0.12093 5.772 7.83e-09 ***
## absences 0.08246 0.02266 3.639 0.000274 ***
## failures 0.48932 0.23073 2.121 0.033938 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 369.50 on 365 degrees of freedom
## AIC: 379.5
##
## Number of Fisher Scoring iterations: 4
Model summary seems to tell that all of the predictive variables have good fit overall and are statistically significant (all others have p-value < 0.001, but for failures the p-value is 0,03 which is not that significant, and it is reasonable to assume that it won’t be significant for prediction power). Also here we cannot be sure that there might be strong correlations between predictors, so perhaps it could be wise to study those also.
Let’s also present the coefficients and also oddsrations, and confidence intervals:
# compute odds ratios (OR) round the results
OR <- coef(m) %>% exp %>% round(2)
# compute confidence intervals (CI), round the results
CI <- confint(m) %>% exp %>% round(2)
## Waiting for profiling to be done...
# print out the odds ratios with their confidence intervals
cbind(OR, CI)
## OR 2.5 % 97.5 %
## (Intercept) 0.02 0.01 0.04
## sexM 2.66 1.61 4.49
## goout 2.01 1.59 2.56
## absences 1.09 1.04 1.14
## failures 1.63 1.04 2.59
Did not have time to make adjustmens with the Intercept factor, but luckily it’s near zero, so it won’t matter that much in this case. It seems that sex and goout odds ratios are way bigger than one, and that would indicate that they are relevant for probability of high alcohol use. Failures and absences do still have figure more than one but just slightly, so they not seem to be so relevant for the high alcohol use probablity. Also the confidence levels for these values indicate that they are not that significant.
Let’s use the logistic regression model for predicting high alcohol use from the selected data. Well calculate probability for high alcohol use and then form a new variable which is true if probability is bigger than 0.5 and false if otherwise. After that we make a cross table for model prediction and actual high use variable (results of the study), and also a graphical presentation of the prediction vs. results.In the end also
# predict() the probability of high_use
probabilities <- predict(m, type = "response")
# add the predicted probabilities to 'alc'
alc <- mutate(alc, probability = probabilities)
# use the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = probability > 0.5)
# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)
## prediction
## high_use FALSE TRUE
## FALSE 243 16
## TRUE 61 50
# initialize a plot of 'high_use' versus 'probability' in 'alc'
g <- ggplot(alc, aes(x = probability, y = high_use, col = prediction))
# define the geom as points and draw the plot
g + geom_point()
Model gives sensible results. We can see that there are wrong results 61+16 vs. correct ones 243+50.
Let’s also calculate a loss function, and use that for calculating inaccuracy of the model. This can also be calculated from the figures in the previous paragraph (incorrect cases divided by all cases).
# define a loss function (average prediction error)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
# call loss_func to compute the average number of wrong predictions in the (training) data
round(loss_func(class = alc$high_use, prob = alc$probability), 4)
## [1] 0.2081
So the model is correct on average for 79 % of the cases and incorrect for 21 % of the cases.
It seems that it’s hard to predict high use (61 times model get it wrong vs. 50 correct ones), but for low uses it’s far more precise (16 wrong vs. 243 correct). But if you’d use simple guessing (flip of a coin for a model) it will be correct approx 50 % of the cases, so the model gives reasonable accuracy.
Let’s make a 10 fold cross-validation for validating the logistic regression model we used above:
# setting the random seed, so we can re-calculate exactly same results over again
seed = 123
# K-fold cross-validation
library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = 10)
# average number of wrong predictions in the cross-validation, this is not adjusted value, but a raw inaccuracy rate in the cross-validation
round(cv$delta[1], 4)
## [1] 0.2162
So cross-validation gives a really close figure comparing to prediction error we had when used all the data for training.
# Let's define the wanted predictor sets (just using the column numbers:
# first test set has many predictors (alomost all of them and more than 40)
testi_1 <- c(1:24, 27, 29:30, 31, 33:48, 50)
# we drop few off in testi_2 predictor set, but it still has quite many (approx 30)
testi_2 <- c(2:24, 29:31, 33:38, 50)
# we drop some more, still keeping the most relevant (approx 20)
testi_3 <- c(2:11, 12, 17, 24, 29:30, 31, 33:35, 50)
# now we keep just 10
testi_4 <- c(2:5, 24, 27, 30, 31, 33, 50)
# the last predictor set has only 3, number 50 is high_use (target variable)
testi_5 <- c(2, 24, 33, 50)
# build a list for different prodictor set
testi_setti <- list(testi_1, testi_2, testi_3,
testi_4, testi_5)
# initialising the result matrix
tulokset <- matrix(nrow = 2, ncol =5)
# defining a loop which takes one testi_setti items (predictor vectors) at a time, and calculates training error for the whole data set and also test error for the 10-fold cross-validation (all this code is copy pasted from above)
for (i in 1:5) {
pred_var <- testi_setti[[i]]
# find the model with glm() (target high_use)
m2 <- glm(high_use~., data = alc[pred_var], family = "binomial")
# predict() the probability of high_use
probabilities <- predict(m2, type = "response")
# add the predicted probabilities to 'alc'
alc <- mutate(alc, probability = probabilities)
# use the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = probability > 0.5)
# call loss_func to compute the average number of wrong predictions in the (training) data
train_err <- round(loss_func(class = alc$high_use, prob = alc$probability), 4)
seed= 123
# cross-validation
cv <- cv.glm(data = alc[pred_var], cost = loss_func, glmfit = m2, K = 10)
# here we'll have results for cross-validation
test_err <- round(cv$delta[1], 4)
# saving the results to tulokset matrix
tulokset[1,i] <- train_err
tulokset[2,i] <- test_err
}
tulokset
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.1757 0.1946 0.2000 0.2162 0.2108
## [2,] 0.2595 0.2838 0.2568 0.2270 0.2162
In tulokset matrix we have the result so that the first row is for training errors for the whole data and second row is test errors for cross-validation. Column 1 model has almost all predictor that one can pick and we decrease predictors (keeping the most relevant) till we get to fifth column which has only three.
It can bee noticed that training error tend to go up when there are fewer predictors in the model. The model learns better from the data and is more complex with more predictors, but on the other hand test error is very large. So the model with many predictors becomes dependent on the data that has been used, and it does not work that well when you fit it with different cross-validation sets. One would say that good model with significant predictor is far more robust, and simpler and passes the testing better.
Tuomas Keski-Kuha 22.11.2021
Exercise 2: Load the Boston data from the MASS package. Explore the structure and the dimensions of the data and describe the dataset briefly, assuming the reader has no previous knowledge of it. Details about the Boston dataset can be seen for example here.
Let’s see the Boston data in the following:
# access a few libraries:
library(dplyr)
library(tidyr)
library(ggplot2)
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.1.2
## corrplot 0.92 loaded
library(GGally)
library(reshape2)
## Warning: package 'reshape2' was built under R version 4.1.2
##
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
##
## smiths
library(plotly)
## Warning: package 'plotly' was built under R version 4.1.2
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
# access the MASS package
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:plotly':
##
## select
## The following object is masked from 'package:dplyr':
##
## select
# load the data
data("Boston")
# explore the dataset
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506 14
Data set information
Data is Housing Values in Suburbs of Boston. It has 506 rows and 14 columns. It has some information on the suburbs of Boston (towns). Here is the link to the data page, where one can see the full details.
Exercise 3: Show a graphical overview of the data and show summaries of the variables in the data. Describe and interpret the outputs, commenting on the distributions of the variables and the relationships between them.
Let’s visualize the data and take a look of the distributions of the variables and relations between variables.
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
# First lets make a long table from Boston data
melt.boston <- melt(Boston)
## No id variables; using all as measure variables
# Then plot every variable, so we'll see the distributions of variables
ggplot(data = melt.boston, aes(x = value)) +
stat_density() +
facet_wrap(~variable, scales = "free")
# calculate the correlation matrix and round it
cor_matrix<-cor(Boston) %>% round(digits = 2)
# visualize the correlation matrix
corrplot(cor_matrix, method="circle", type="upper", cl.pos="b", tl.pos="d", tl.cex = 0.6)
There is significant variability in the data as some variables are really skewed towards the minimum or the maximum value (most of the values are near or exactly minimum or maximum). There is also few integer variables in the data, which don’t fit quite well to the density plot above. Few of the variables seems normally distributed.
It is clear that there is really strong correlations (positive and negative) between variables in the data, as correlation plot shows. In the correlation plot the colour and the size of the circles tells about the correlations between variables.
Exercise 4: Standardize the dataset and print out summaries of the scaled data. How did the variables change? Create a categorical variable of the crime rate in the Boston dataset (from the scaled crime rate). Use the quantiles as the break points in the categorical variable. Drop the old crime rate variable from the dataset. Divide the dataset to train and test sets, so that 80% of the data belongs to the train set.
Let’s scale the variables so that the mean = 0 and variance = 1 for all variables in the dataset:
# center and standardize variables
boston_scaled <- scale(Boston)
# change the object to data frame
boston_scaled <- as.data.frame(boston_scaled)
# summaries of the scaled variables
summary(boston_scaled)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
We can see now that all the variables have negative and positive values around zero.
In the following section we create a new categorical quantile variable from crim variable, so that the new variable has labels low, med_low, med_high, high depending the scaled crim variable, and all the new variables have 25 % of the data.
# create a quantile vector of crim and print it
bins <- quantile(boston_scaled$crim)
# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, labels = c("low", "med_low", "med_high", "high"))
# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)
# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)
After that we make training and test sets from the data, so that we pick randomly 80 % of the rows to the training set and the rest to the test set.
# number of rows in the Boston dataset
n <- nrow(boston_scaled)
set.seed(123)
# choose randomly 80% of the rows
ind <- sample(n, size = n * 0.8)
# create train set
train <- boston_scaled[ind,]
# create test set
test <- boston_scaled[-ind,]
Exercise 5: Fit the linear discriminant analysis on the train set. Use the categorical crime rate as the target variable and all the other variables in the dataset as predictor variables. Draw the LDA (bi)plot.
Let’s fit the LDA model to training data and visualize results.
# linear discriminant analysis
lda.fit <- lda(crime ~ ., data = train)
# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2549505 0.2500000 0.2500000 0.2450495
##
## Group means:
## zn indus chas nox rm age
## low 1.01866313 -0.9066422 -0.08120770 -0.8835724 0.38666334 -0.9213895
## med_low -0.07141687 -0.3429155 0.03952046 -0.5768545 -0.09882533 -0.3254849
## med_high -0.40148591 0.2162741 0.19544522 0.3637157 0.12480815 0.4564260
## high -0.48724019 1.0149946 -0.03371693 1.0481437 -0.47733231 0.8274496
## dis rad tax ptratio black lstat
## low 0.9094324 -0.6819317 -0.7458486 -0.4234280 0.3729083 -0.766883775
## med_low 0.3694282 -0.5395408 -0.4205644 -0.1079710 0.3103406 -0.164921798
## med_high -0.3720478 -0.4349280 -0.3191090 -0.2716959 0.1049654 0.009720124
## high -0.8601246 1.6596029 1.5294129 0.8057784 -0.6383964 0.900379309
## medv
## low 0.47009410
## med_low 0.01548761
## med_high 0.19440747
## high -0.71294711
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.148390645 0.74870532 -1.0874785
## indus 0.040971465 -0.38126652 -0.1619456
## chas 0.002460776 0.03963849 0.1699312
## nox 0.312458150 -0.67564471 -1.3104018
## rm 0.011086947 -0.16058718 -0.1572603
## age 0.283482486 -0.38817624 -0.1013491
## dis -0.079848789 -0.38493775 0.2108038
## rad 3.718978412 0.93123532 -0.4706522
## tax -0.015197127 0.04230505 1.2889075
## ptratio 0.180294774 -0.01592588 -0.3558490
## black -0.136724112 0.02948075 0.1288959
## lstat 0.145739238 -0.37823065 0.3345688
## medv 0.061327205 -0.53906503 -0.1509890
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9523 0.0364 0.0113
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "orange", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(train$crime)
# plot the lda results
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 1)
Discrimination analysis seem to fit the data quite well, as plot indicates. Rad variable (index of accessibility to radial highways) and also zn (proportion of residential land zoned for lots over 25,000 sq.ft.) seem to have biggest effect on the classification process. It looks like rad variable gives really good discrimination for higher crime towns. For the lower crim towns it’s not that clear at all. In the picture it seems that rad variable separates higher crime towns from the rest.
Exercise 6: Save the crime categories from the test set and then remove the categorical crime variable from the test dataset. Then predict the classes with the LDA model on the test data. Cross tabulate the results with the crime categories from the test set. Comment on the results.
In the next phase we save crime gategories from the test set. Also we make predictions from test data, and see how accurate the model is:
# save the correct classes from test data
correct_classes <- test$crime
# remove the crime variable from test data
test <- dplyr::select(test, -crime)
# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)
# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 10 10 4 0
## med_low 2 17 6 0
## med_high 1 9 13 2
## high 0 0 1 27
Model’s predicting power seems to be good indeed. Just few bigger inaccuracies, where correct class is two classes away. Still all results end up quite near the diagonal. Especially the high class seems to be “easy” to predict.
Exercise 7: Reload the Boston dataset and standardize the dataset (we did not do this in the Datacamp exercises, but you should scale the variables to get comparable distances). Calculate the distances between the observations. Run k-means algorithm on the dataset. Investigate what is the optimal number of clusters and run the algorithm again. Visualize the clusters (for example with the pairs() or ggpairs() functions, where the clusters are separated with colors) and interpret the results.
# Load the data again and after that standardize all variables (mean = 0 and variance = 1)
data("Boston")
# center and standardize variables
boston_scaled <- scale(Boston)
# change the object to data frame
boston_scaled <- as.data.frame(boston_scaled)
Then, we calculate the distances and apply K-means algorithm to make clusters:
# euclidean distance matrix
dist_eu <- dist(boston_scaled)
# look at the summary of the distances
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
# k-means clustering
km <-kmeans(boston_scaled, centers = 3)
Finding the optimal value for the number of clusters:
set.seed(123)
# determine the number of clusters
k_max <- 10
# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(Boston, k)$tot.withinss})
# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')
The optimal value is the value where the WCSS (within cluster sum of squares) drops radically (here the optimal number of clusters for K-means seems to be 2).
# k-means clustering when k = 2
km <-kmeans(boston_scaled, centers = 2)
# plot the Boston scaled dataset with clusters
pairs(boston_scaled, col = km$cluster)
We can see that perhaps crim, indus, rad, tax, black variables are best clustering variables, so let’s see figures with only those variables.
# lets make a vector containig significant separating variables
a_1 <- c(1, 3, 9, 10, 12)
# plot the Boston scaled dataset with clusters
pairs(boston_scaled[, a_1], col = km$cluster)
Presented variables seems to have largest effect on the clustering as in the plot we can see good separation between the black and pink groups.
Next, we draw few plot more where one can see these significant variable values by clusters:
# let's draw few plots more where significant variable values are presented in a boxplot by clusters
par(mfrow=c(1,3))
boxplot(boston_scaled$rad ~ km$cluster, main = "Rad by clusters")
boxplot(boston_scaled$tax ~ km$cluster, main = "Tax by clusters")
boxplot(boston_scaled$black ~ km$cluster, main = "Black by clusters")
boxplot(boston_scaled$indus ~ km$cluster, main = "Indus by clusters")
boxplot(boston_scaled$crim ~ km$cluster, main = "Crim by clusters")
Perhaps one would not become wiser on looking drawn boxplots because many of the picked significant clustering variables has just few instances by cluster (e.g. crim, indus tax), so one can critisize the separating power here by just one variable. But at least we had more some nice pictures here.
Bonus exercise: Perform k-means on the original Boston data with some reasonable number of clusters (> 2). Remember to standardize the dataset. Then perform LDA using the clusters as target classes. Include all the variables in the Boston data in the LDA model. Visualize the results with a biplot (include arrows representing the relationships of the original variables to the LDA solution). Interpret the results. Which variables are the most influencial linear separators for the clusters?
# Load the data again and after that standardize all variables (mean = 0 and variance = 1)
data("Boston")
# center and standardize variables
boston_scaled <- scale(Boston)
# change the object to data frame
boston_scaled <- as.data.frame(boston_scaled)
# k-means clustering
km <-kmeans(boston_scaled, centers = 5)
Let’s add clusters to the data and then apply linear discrinmination analysis to the clusters that we got from K-means clustering:
# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, km$cluster)
set.seed(123)
# linear discriminant analysis
lda.fit_2 <- lda(km.cluster ~ ., data = boston_scaled)
# plot the lda results
plot(lda.fit_2, dimen = 2, col = boston_scaled$km.cluster, pch = boston_scaled$km.cluster)
lda.arrows(lda.fit_2, myscale = 1)
It seems clear that chas and rad variables are most influencial linear separators for the clusters.
model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404 13
dim(lda.fit$scaling)
## [1] 13 3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = ~train$crime)
# k-means clustering
km_2 <-kmeans(matrix_product, centers = 4)
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = ~km_2$cluster)
Not really sure what was the intention with the latter 3D-plot. I applied K-means (k=4, same amount than in the first plot where predicted) to matrix product data (prediction for the LDA model) intending then to do clustering for predictions. This latter K-means clusterin makes data to look reaaly smoothly separated and there is no outliers or obvious outliers, or data points “inside” other group as in the first one. Now that I look at the second plot, it looks like just the same as you would have if you plot predictions od LDA model.